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Abstract. The semantic image segmentation task presents a trade-off 
between test time accuracy and training-time annotation cost. Detailed 
per-pixel annotations enable training accurate models but are very time- 
consuming to obtain; image-level class labels are an order of magnitude 
cheaper but result in less accurate models. We take a natural step from 
image-level annotation towards stronger supervision: we ask annotators 
to point to an object if one exists. We incorporate this point supervision 
along with a novel objectness potential in the training loss function of a 
CNN model. Experimental results on the PASCAL VOC 2012 benchmark 
reveal that the combined effect of point-level supervision and object¬ 
ness potential yields an improvement of 12.9% mlOU over image-level 
supervision. Eurther, we demonstrate that models trained with point- 
level supervision are more accurate than models trained with image-level, 
squiggle-level or full supervision given a fixed annotation budget. 

Keywords: semantic segmentation, weak supervision, data annotation 


1 Introduction 


At the forefront of visual recognition is the question of how to effectively teach 
computers new concepts. Algorithms trained from carefully annotated data enjoy 
better performance than their weakly supervised counterparts (e.g., [T] vs. 
vs. a, 0 vs. 0), yet obtaining such data is very time-consuming m- 

It is particularly difficult to collect training data for semantic segmentation, 
i.e., the task of assigning a class label to every pixel in the image. Strongly 
supervised methods require a training set of images with per-pixel annota¬ 
tions |3l8l9llQlllll2j (Fig. Providing an accurate outline of a single object 
takes between 54 seconds Js] and 79 seconds [5]. A typical indoor scene con¬ 
tains 23 objects [14], raising the annotation time to tens of minutes per image. 
Methods have been developed to reduce the annotation time through effective 
interfaces |5ll5ll6ll7ll8ll9j . e.g., through requesting human feedback only as 
necessary m- Nevertheless, accurate per-pixel annotations remain costly and 


scarce. 
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Fig. 1. Semantic segmentation models trained with our point-level supervision are 
much more accurate than models trained with image-level supervision (and even more 
accurate than models trained with full pixel-level supervision given the same annotation 
budget). The second two columns show test time results. 


To alleviate the need for large-scale detailed annotations, weakly supervised 
semantic segmentation techniques have been developed. The most common set¬ 
ting is where only image-level labels for the presence or absence of classes are 
provided during training |4l2ni21l22l2.SI24l25j . but other forms of weak super¬ 
vision have been explored as well, such as bounding box annotations i , eye 
tracks [26], free-form squiggles mm, or noisy web tags m- These methods 
require significantly less annotation effort during training, but are not able to 
segment new images nearly as accurately as fully supervised techniques. 

In this work, we take a natural step towards stronger supervision for semantic 
segmentation at negligible additional time, compared to image-level labels. The 
most natural way for humans to refer to an object is by pointing: “That cat over 
there” (point) or “What is that over there?” (point). Psychology research has 
indicated that humans point to objects in a consistent and predictable way |3l28j . 
The fields of robotics fTT)I^ and human-computer interaction |9] have long used 
pointing as the effective means of communication. However, point annotation is 
largely unexplored in semantic segmentation. 

Our primary contribution is a novel supervision regime for semantic seg¬ 
mentation based on humans pointing to objects. We extend a state-of-the-art 
convolutional neural network (CNN) framework for semantic segmentation |5l23j 
to incorporate point supervision in its training loss function. With just one an¬ 
notated point per object class, we considerably improve semantic segmentation 
accuracy. We ran an extensive human study to collect these points on the PAS¬ 
CAL VOC 2012 dataset and evaluate the annotation times. We also make the 
user interface and the annotations available to the community. 

One lingering concern with supervision at the point level is that it is difficult 
to infer the full extent of the object. Our secondary contribution is incorpo¬ 
rating an generic object ness prior m directly in the loss to guide the training 
of a CNN. This prior helps separate objects (e.g., car, sheep, bird) from back¬ 
ground (e.g., grass, sky, water), by providing a probability that a pixel belongs 
to an object. Such priors have been used in segmentation literature for selecting 
image regions to segment m, as unary potentials in a conditional random field 
model m, or during inference [25]. However, to the best of our knowledge, we 
are the first to employ this directly in the loss to guide the training of a CNN. 

The combined effect of our contributions is a substantial increase of 12.9% 
mean intersection over union (mlOU) on the PASCAL VOC 2012 dataset [32] 
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compared to training with image-level labels. Further, we demonstrate that mod¬ 
els trained with point-level supervision outperform models trained with image- 
level, squiggle-level, and full supervision by 2.7 — 20.8% mlOU given a fixed 
annotation budget. 

2 Related Work 

Types of Supervision for Semantic Segmentation. To reduce the up-front 
annotation time for semantic segmentation, recent works have focused on train¬ 
ing models in a weakly- or semi-supervised setting. Many forms of supervision 
have been explored, such as eye tracks [26], free-form squiggles mm , noisy web 
tags m, size constraints on objects |6| or heterogeneous annotations [33]. Com¬ 
mon settings are image-level labels |4l23l25j and bounding boxes [4lMj . |14I35I36| 
use co-segmentation methods trained from image-level labels to automatically 
infer the segmentations. |6l23l25j train CNNs supervised only with image-level la¬ 
bels by extending the Multiple-Instance Learning (MIL) framework for semantic 
segmentation. |4l34j use an EM procedure, which alternates between estimating 
pixel labels from bounding box annotations and optimizing the parameters of a 
CNN. 

There is a trade-off between annotation time and accuracy: models trained 
with higher levels of supervision perform far better than weakly-supervised mod¬ 
els, but require large strongly-supervised datasets, which are costly and scarce. 
We propose an intermediate form of supervision, using points, which adds neg¬ 
ligible additional annotation time to image-level labels, yet achieves better re¬ 
sults. m also uses point supervision during training, but it trains a patch-level 
CNN classifier to serve as a unary potential in a CRF, whereas we use point 
supervision directly during CNN training. 

CNNs for Segmentation. Recent successes in semantic segmentation have 
been driven by methods that train CNNs originally built for image classification 
to assign semantic labels to each pixel in an image 15111 IH1i:-{7| . One extension of 
the fully convolutional network (FCN) architecture developed by [5] is to train 
a multi-layer deconvolution network end-to-end [38]. More inventive forms of 
post-processing have also been developed, such as combining the responses at 
the final layer of the network with a fully-connected CRF We develop our 
approach on top of the basic framework common to many of these methods. 

Interactive Segmentation. Some semantic segmentation methods are in¬ 
teractive, in that they collect additional annotations at test time to refine the 
segmentation. These annotations can be collected as points [2] or free-form squig¬ 
gles US]. These methods require additional user input at test time; in contrast, 
we only collect user points once and only use them at training time. 

3 Semantic Segmentation Method 

We describe here our approach to using point-level supervision (Fig.|^ for train¬ 
ing semantic segmentation models. In Section [^ we will demonstrate that this 
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Original image FCN [5] Segmentation 
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Levels of supervision 


person 


Full Image-level Point-level Objectness prior 

Fig. 2. (Top): Overview of our semantic segmentation training framework. (Bottom): 
Different levels of training supervision. For full supervision, the class of every pixel is 
provided. For image-level supervision, the class labels are known but their locations are 
not. We introduce point-level supervision, where each class is only associated with one 
or a few pixels, corresponding to humans pointing to objects of that class. We include 
an objectness prior in our training loss function to accurately infer the object extent. 






level of supervision is cheap and efficient to obtain. In our setting (in contrast 
to la), supervised points are only provided on training images. The learned 
model is then used to segment test images with no additional human input. 

Current state-of-the-art semantic segmentation methods |4|5I23I25I37| . both 
supervised and unsupervised, employ a unified CNN framework. These networks 
take as input an image of size W x H and output di W x H x N score map 
where N is the set of classes the CNN was trained to recognize (Fig. [^. At test 
time, the score map is converted to per-pixel predictions of size W x H hy either 
simply taking the maximally scoring class at each pixel |5l23j or employing more 
complicated post-processing |4I25I37| . 

Training models with different levels of supervision requires defining appro¬ 
priate loss functions in each scenario. We begin by presenting two of the most 
commonly used in the literature. We then extend them to incorporate (1) our 
proposed point supervision and (2) a novel objectness prior. 

Pull Supervision. When the class label is available for every pixel during 
training, the CNN is commonly trained by optimizing the sum of per-pixel cross¬ 
entropy terms ISEZ]. Let I be the set of pixels in the image. Let Sic be the CNN 
score for pixel i and class c. Let Sic = exp(sic)/be the softmax 
probability of class c at pixel i. Given a ground truth map G indicating that 
pixel i belongs to class Gi, the loss on a single training image is: 

/:p«(5,G) = -^iogfe) (1) 

The loss is simply zero for pixels where the ground truth label is not defined 
(e.g., in the case of pixels defined as “difficult” on the boundary of objects in 
PASCAL VOC [32]). 
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Image-Level Supervision. In this case, the only information available dur¬ 
ing training are the sets L C A^} of classes present in the image and 

L' C ,N} of classes not present in the image. The CNN model can be 

trained with a different cross-entropy loss: 

Cimg{S,L,L’) = - ^ E - ^*cc) (2) 

I I cGL I I cGL' 

with tc = arg max Sic 

The first part of Eqn. (§, corresponding to c G L, is used in [23]. It encourages 
each class in L to have a high probability on at least one pixel in the image. 
The second part has been added in [6], corresponding to the fact that no pixels 
should have high probability for classes that are not present in the image. 

Point-Level Supervision. We study the intermediate case where the object 
classes are known for a small set of supervised pixels whereas other pixels 
are just known to belong to some class in L. We generalize Eqns. 0 and to: 

£point(5, G, L, L') = L, L') - E (3) 

Here, determines the relative importance of each supervised pixel. We ex¬ 
periment with several formulations for (1), for each class we ask the user to 
either determine that the class is not present in the image or to point to one 
object instance. In this case, \Is\ = \L\ and ai is uniform for every point; (2), 
we ask multiple annotators to do the same task as (1), and we set ai to be the 
confidence of the accuracy of the annotator that provided the point; (3), we ask 
the annotator(s) to point to every instance of the classes in the image, and ai 
corresponds to the order of the points: the first point is more likely to correspond 
to the largest object instance and thus deserves a higher weight ai. 

Objectness Prior. One issue with training models with very few or no su¬ 
pervised pixels is correctly inferring the spatial extent of the objects. In general, 
weakly supervised methods are prone to local minima: focusing on only a small 
part of the target object, or predicting all pixels as belonging to the background 
class [23|. To alleviate this problem, we introduce an additional term in our 
training objective based on an objectness prior (Eig. |^. Objectness provides a 
probability for whether each pixel belongs to any object class |30] (e.g., bird, car, 
sheep), as opposed to background (e.g., sky, water, grass). These probabilities 
have been used in the weakly supervised semantic segmentation literature before 
as unary potentials in graphical models [20| or during inference following a CNN 
segmentation [25]. To the best of our knowledge, we are the first to incorporate 
them directly into CNN training. 

Let Pi be the probability that pixel i belongs to an object. Let O be the 
classes corresponding to objects, with the other classes corresponding to back¬ 
grounds. In PASCAL VOC, O is the 20 object classes, and there is a single 
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generic background class. We define a new loss: 





(4) 


At pixels with high Pi values, this objective encourages placing probability mass 
on object classes. Alternatively, when Pi is low, it prefers mass on the background 
class. Note that Cohj requires no human supervision (beyond pre-training the 
generic objectness detector), and thus can be combined with any loss above. 

4 Crowdsourcing Annotation Data 

In this section, we describe our method for collecting annotations for the different 
levels of supervision. The annotation time required for point-level and squiggle- 
level supervision was measured directly during data collection. For other types 
of supervision, we rely on the annotation times reported in the literature. 

Image-Level Supervision (20.0 sec/img). Collecting image-level labels 
takes 1 second per class [26]. Thus, annotating an image with 20 object classes 
in PASCAL VOC is expected to take 20 seconds per image. 

Full Supervision (239.7 sec/img). There are 1.5 object classes per image 
on average in PASCAL VOC 2012 [32|. It takes 1 second to annotate every 
object that is not present (to obtain an image-level “no” label), for 18.5 seconds 
of labeling time. Additionally, there are 2.8 object instances on average per 
image that need to be segmented [32]. The authors of the COCO dataset report 
22 worker hours for 1,000 segmentations m- This implies a mean labeling time 
of 79 seconds per object segmentation, adding 2.8 x 79 seconds of labeling in our 
case. Thus, the total expected annotation time is 239.7 seconds per image. 

4.1 Point-Level Supervision (22.1 sec/img) 

We used Amazon Mechanical Turk (AMT) to annotate point-level supervision on 
20 PASCAL VOC object classes over 12,031 images: all training and validation 
images of the PASCAL VOC 2012 segmentation task [32] plus the additional 
images of [39]. Fig. (left) shows the annotation interface and Fig. (center) 
shows some collected data. We use two different point-level supervision tasks. 
For each image, we obtain either (1) one annotated point per object class, on 
the first instance of the class the annotator sees (IPomt), and (2) one annotated 
point per object instance (AllPoints). We make these collected annotations and 
the annotation system publicly available. 

Annotation Time. There are 1.5 classes on average per image in PASCAL 
VOC 2012. It takes workers a median of 2.4 seconds to click on the first instance 
of an object. Therefore, the labeling time of IPoint is 1 x 18.5 + 1.5 x 2.4 = 22.1 
seconds per image. It takes workers a median of 0.9 seconds to click on every 
additional instance of an object class. There are 2.8 instances on average per 
image, thus the labeling time of AllPoints is 1 x 18.5+1.5 x 2.4+(2.8—1.5) x 0.9 = 
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Fig. 3. Left. AMT annotation UI for point-level supervision. Center. Example points 
collected. Right. Example squiggles collected. Colors correspond to different classes. 


23.3 seconds per image. Note that point supervision is only l.l-1.2x more time- 
consuming than obtaining image-level labels, and more than lOx cheaper than 
full supervision. 

Quality Control. Quality control for point annotation was done by planting 
10 evaluation images in a 50-image task and ensuring that at least 8 are labeled 
correctly. We consider a point correct if it falls inside a tight bounding box 
around the object. For the AllPoints task, the number of annotated clicks must 
be at least the number of known object instances. 

Error Rates. Simply determining the presence or absence of an object class 
in an image was fairly easy, and workers incorrectly labeled an object class as 
absent only 1.0% of the time. On the IPoint task, 7.2% of points were on a 
pixel with a different class label (according to the PASCAL ground truth), and 
an additional 0.8% were on an unclassified “difficult” pixel. For comparison, m 
reports much higher 25% average error rates when drawing bounding boxes. 
Our collected data is high-quality, confirming that pointing to objects comes 
naturally to humans m- 

Annotators had more difficulty with the AllPoints class: 7.9% of ground 
truth instances were left unannotated, 14.8% of the clicks were on the wrong 
object class, and 1.6% on “difficult” pixels. This task caused some confusion 
among workers due to blurry or very small instances; for example, many of these 
instances are not annotated in the ground truth but were clicked by workers, 
accounting for the high false positive rate. 


4.2 Squiggle-Level Supervision (34.9 sec/img) 

mm have experimented with training with free-form squiggles, where a subset 
of pixels are labeled. While m simulates squiggles by randomly labeling super¬ 
pixels from the ground truth, we follow [18] in collecting squiggle annotations 
(and annotation times) from humans for 20 object classes on all PASCAL VOC 
2012 trainval images. This allows us to properly compare this supervision setting 
to human points. We extend the user interface shown in Fig. (left) by asking 
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annotators to draw one squiggle on one instance of the target class. Fig. m (right) 
shows some collected data. 

Annotation Time. As before, it takes 18.5 seconds to annotate the classes 
not present in the image. For every class that is present, it takes 10.9 seconds 
to draw a free-form squiggle on the target class. Therefore, the labeling time 
of ISquiggle is 18.5 + 1.5 x 10.9 = 34.9 seconds per image. This is 1.6x more 
time-consuming than obtaining 1 Point point-level supervision and 1.7x more 
than image-level labels. 

Error Rates. We used similar quality control to point-level supervision. 
Only 6.3% of the annotated pixels were on the wrong object class, and an addi¬ 
tional 1.4% were on pixels marked as “difficult” in PASCAL VOC [3^ . 

In Section we compare the accuracy of the models trained with different 
levels of supervision. 

5 Experiments 

We empirically demonstrate the efficiency of our point-level and objectness prior. 
We compare these forms of supervision against image-level labels, squiggle-level, 
and fully supervised data. We conclude that point-level supervision makes a 
much more efficient use of annotator time, and produces much more effective 
models under a fixed time budget. 


5.1 Setup 

Dataset. We train and evaluate on the PASCAL VOC 2012 segmentation 
dataset [32] augmented with extra annotations from [39]. There are 10,582 train¬ 
ing images, 1,449 validation images and 1,456 test images. We report the mean 
intersection over union (mlOU), averaged over 21 classes. 

CNN Architecture. We use the state-of-the-art fully convolutional network 
model [5]. Briefly, the architecture is based on the VGG 16-layer net [8], with all 
fully connected layers converted to convolutional layers. The last classifier layer 
is discarded and replaced with a 1x1 convolution layer with channel dimension 
V = 21 equal to the number of object classes. The final modification is the 
addition of a deconvolution layer to bilinearly upsample the output to pixel- 
level dense predictions. 

CNN Training. We train following a procedure similar to [5]. We use 
stochastic gradient descent with a fixed learning rate of 10“^, doubling the learn¬ 
ing rate for biases, and with a minibatch of 20 images, momentum of 0.9 and 
weight decay 0.0005. The network is initialized with weights pre-trained for a 
1000- way classification task of the ILSVRG 2012 dataset |5l7l8j {^ In the fully 
supervised case we zero-initialize the classifier weights [5] , and for all the weakly 
supervised cases we follow [23] to initialize them with weights learned by the 


^ Standard in the literature I1I4I5I23I25I3VI . We do not consider the cost of collecting 
those annotations; including them would not change our overall conclusions. 
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original VGG network for classes common to both PASCAL and ILSVRC. We 
backpropagate through all layers to fine-tune the network, and train for 50,000 
iterations. We build directly on the publicly available implementation of mQ 
Objectness prior. We calculate the per-pixel objectness prior by assigning 
each pixel the average objectness score of all windows containing it. These scores 
are obtained by using the pre-trained model from the released code of [30] . The 
model is trained on 50 images with 291 object instances randomly sampled 
from a variety of different datasets (e.g., INRIA Person, Caltech 101) that do 
not overlap with PASCAL VOC 2007-2012 [30|. For fairness of comparison, we 
include the annotation cost of training the objectness prior. We estimate the 291 
bounding boxes took 10.2 seconds each on average to obtain HOI , for 49.5 minutes 
of annotation. Amortized across the 10,582 PASCAL training images, using the 
objectness prior thus costs 0.28 seconds of extra annotation per image. 


5.2 Synergy Between Point-Level Supervision and Objectness Prior 

We first establish the baselines of our model and show the benefits of both point- 
level supervision and objectness prior. Table[^(top) summarizes our findings and 
Table (top) shows the per-class accuracy breakdown. 

Baseline. We train a baseline segmentation model from image-level labels 
with no additional information. We base our model on [23] , which trains a similar 
fully convolutional network and obtains 25.1% mlOU on the PASCAL VOC 2011 
validation set. We notice that the absence of a class label in an image is also an 
important supervisor signal, along with the presence of a class label, as in [6]. 
We incorporate this insight into our loss function Cimg in Eqn. and see a 
substantial 5.4% improvement in mlOU from the baseline, when evaluated on 
the PASCAL VOC 2011 validation set. 

Effect of Point-Level Supervision. We now run a key experiment to 
investigate how having just one annotated point per class per image improves 
semantic segmentation accuracy. We use loss Cpoint of Eqn. On average there 
are only 1.5 supervised pixels per image (as many as classes per image). All other 
pixels are unsupervised. We set a = 1/n where n is the number of supervised 
pixels on a particular training image. On the PASCAL VOC 2012 validation set, 
the accuracy of a model trained using Cimg is 29.8% mIOU. Adding our point 
supervision improves accuracy by 5.3% to 35.1% mlOU (row 3 in Table [^. 

Effect of Objectness Prior. One issue with training models with very few 
or no supervised pixels is the difficulty of inferring the full extent of the object. 
With image-level labels, the model tends to learn that objects occupy a much 
greater area than they actually do (second column of Eig. [^. We introduce the 
objectness prior in the loss using Eqn. 0 to aid the model in correctly predicting 
the extent of objects (third column on Eig. [^. This improves segmentation 

^ [5] introduces additional refinement by decreasing the stride of the output layers 
from 32 pixels to 8 pixels, which improves their results from 59.7% to 62.7% mlOU 
on the PASCAL VOC 2011 validation set. We use the original model with stride of 
32 for simplicity. 
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Original Image-level Image-level Point-level Full 

image supervision + objectness + objectness supervision 



^ background ^ car ^ dog ^ horse ^ motorbike ^ person | -j sheep 


Fig. 4. Qualitative results on the PASCAL VOC 2012 validation set. The model trained 
with image-level labels usually predicts the correct classes and their general locations, 
but it over-extends the segmentations. The objectness prior improves the accuracy of 
the image-level model by helping infer the object extent. Point supervision aids in 
separating distinct objects (row 2) and classes (row 4) and helps correctly localize the 
objects (rows 3 and 4). Best viewed in color. 


accuracy: when supervised only with image-level labels, the Img model obtained 
29.8% mlOU, and the Img + Ohj model improves to 32.2% mIOU. 

Effect of Combining Point-Level Supervision and Objectness. The 

effect of the objectness prior is even more apparent when used together with 
point-level supervision. When supervised with IPoint^ the Img model achieves 
35.1% mlOU, and the Img + Ohj model improves to 42.7% mlOU (rows 3 and 
4 in Table [^. Conversely, when starting from the Img + Ohj image-level model, 
the effect of a single point of supervision is stronger. Adding just one point per 
class improves accuracy by 10.5% from 32.2% to 42.7%. 

Conclusions. We make two conclusions. First, the objectness prior is very 
effective for training these models with none or very few supervised pixels - and 
this comes with no additional human supervision cost on the target dataset. For 
the rest of the experiments, whenever not all pixels are labeled (i.e., all but full 
supervision) we always use Img + Ohj together. Second, our two contributions 
operate in synergetic ways. The combined effect of both point-level supervision 
and objectness prior is a +13% improvement (from 29.8% to 42.7% mlOU). 
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Supervision 

Time (s) 

Model 

mlOU (%) 

Image-level labels 

20.0 

I mg 

29.8 

Image-level labels 

20.3 

Img + Obj 

32.2 

IPoint 

22.1 

I mg 

35.1 

IPoint 

22.4 

Img + Obj 

42.7 

AllPoints 

23.6 

Img + Obj 

42.7 

AllPoints (weighted) 

23.5 

Img + Obj 

43.4 

IPoint (3 annotators) 

29.6 

Img + Obj 

43.8 

IPoint (random annotators) 

22.4 

Img + Obj 

42.8 - 43.8 

IPoint (random points) 

240 

Img + Obj 

46.1 

Full supervision 

239.7 

Img 

58.3 

Hybrid approach 

24.5 

Img + Obj 

53.1 

1 squiggle per class 

35.2 

Img + Obj 

49.1 


Table 1. Results on the PASCAL VOC 2012 validation set, including both annotation 
time (second column) and accuracy of the model (last column). Top, middle and bottom 
correspond to Sections |5.2[ |5.3| and |5.4| respectively. 


5.3 Point-Level Supervision Variations 

Our goal in this section is to build a deeper understanding of the properties of 
point-level supervision that make it an advantageous form of supervision. Table 
summarizes our findings and Table shows the per-class accuracy breakdown. 

Multiple Instances. Using points on all instances (AllPoints) instead of 
just one point per class (IPoint) remains at 42.7% mlOU: the benefit from extra 
supervision is offset by the confusion introduced by some difficult instances that 
are annotated. We introduce a weighting factor ai = X/T' in Eqn. where r 
is the ranked order of the point (so the first instance of a class gets weight 1, 
the second instance gets weight 1/2, etc.). This AllPoints (weighted) method 
improves results by a modest 0.7% to 43.4% mIOU. 

Patches. The segmentation model effectively enforces spatial label smooth¬ 
ness, so increasing the area of supervised pixels by a radius of 2, 5 and 25 pixels 
around a point has little effect, with 43.0 — 43.1% mlOU (not shown in Tablej^. 

Multiple Annotators. We also collected IPoint data from 3 different anno¬ 
tators and used all points during training. This achieved a modest improvement 
of 1.1% from 42.7% to 43.8%, which does not seem worth the additional anno¬ 
tation cost (29.3 versus 22.1 seconds per image). 

Random Annotators. Using the data from multiple annotators, we also 
ran experiments to estimate the effect of human variance on the accuracy of 
the model. For each experiment, we randomly selected a different independent 
annotator to label each image. Three runs achieved 42.8, 43.4, and 43.8 mlOU 
respectively, as compared to our original result of 42.7 mIOU. This suggests 
that the variation in the location of the annotators’ points does not significantly 
affect our results. This also further confirms that humans are predictable and 
consistent in pointing to objects |3l28j . 

Random Points. An interesting experiment is supervising with one point 
per class, but randomly sampled on the target object class using per-pixel super- 
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Model 


bg aer bic bir boa bot bus car cat cha cow din dog hor mot per pot she sof tra tv avg 


I mg 

60 25 15 23 

21 

20 

48 

36 

47 

9 

34 

21 

37 

32 

37 

18 

24 

34 21 40 24 

30 

I mg +Ohj 

79 42 20 39 

33 

17 

34 

39 

45 

10 

35 

13 

42 

34 

33 

23 

19 

40 15 38 28 

32 

Img -\-lPoint 

56 25 16 22 

20 

31 

53 

34 

53 

8 

41 

42 

43 

40 

42 

46 

24 

38 29 46 30 

35 

Img +lPoint 
+Obj 

78 49 23 37 

37 

37 

57 

50 

51 

14 

40 

41 

50 

38 

51 

47 

31 

48 28 49 45 

43 

AllPoints 

79 49 21 40 

38 

38 

50 

45 

53 

17 

43 

40 

47 

44 

51 

51 

22 

47 29 52 44 

43 

AllPoints 

77 48 23 38 

36 

38 

57 

52 

52 

13 

42 

41 

50 

43 

52 

46 

31 

49 28 50 44 

43 

(weighted) 

IPoint 

79 50 23 39 

37 

39 

60 

50 

54 

15 

41 

42 

49 

42 

52 

50 

29 

49 29 49 44 

44 

(3 annot.) 

IPoint 

(random) 

80 49 23 39 

41 

46 

60 

61 

56 

18 

38 

41 

54 

42 

55 

57 

32 

51 26 55 45 

46 


Table 2. Per-class segmentation accuracy (%) on the PASCAL VOC 2012 validation 
set. (Top) Models trained with image-level, point supervision and (optionally) an ob- 
jectness prior described in Section [5^ (Bottom) Models supervised with variations of 
point-level supervision described in Section [5.3| 


vised ground truth annotations (instead of asking humans to click on the object). 
This improved results over the human points by 3.4%, from 42.7% to 46.1%. This 
is due to the fact that humans are predictable and consistent in pointing |28l3j , 
which reduces the variety in point-level supervision across instances. 

5.4 Incorporating Stronger Supervision 

Hybrid Approach with Points and Pull Supervision. A fully supervised 
segmentation model achieves 58.3% mlOU at a cost of 239.7 seconds per im¬ 
age; recall that a point-level supervised model achieves 42.7% at a cost of 

22.4 seconds per image. We explore the idea of combining the benefits of the 
high accuracy of full supervision with the low cost of point-level supervision. 
We train a hybrid segmentation model with a combination of a small number 
of fully-supervised images (100 images in this experiment), and a large num¬ 
ber of point-supervised images (the remaining 10,482 images in PASCAL VOC 
2012). This model achieves 53.1% mlOU, a significant 10.4% increase in ac¬ 
curacy over the IPoint model, falling only 5.2% behind full supervision. This 
suggests that the first few fully-supervised images are very important for learn¬ 
ing the extent of objects, but afterwards, point-level supervision is quite effec¬ 
tive at providing the location of object classes. Importantly, this hybrid model 
maintains a low annotation time, at an average of only 24.5 seconds per image: 
(100 X 239.7 +10482 x 22.4)/(100 +10482) = 24.5 seconds, which is 9.8x cheaper 
than full supervision. We will further explore the tradeoffs between annotation 
cost and accuracy in Section [5^ 

Squiggles. Free-form squiggles are a natural extension of points towards 
stronger supervision. Squiggle-level supervision annotates a larger number of 
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Supervision 

mlOU (%) 

Full (883 imgs) 

22.1 

Image-level (10,582 imgs) 

29.8 

Squiggle-level (6,064 imgs) 

40.2 

Point-level (9,576 imgs) 

42.9 


Table 3. Accuracy of models on the PAS¬ 
CAL VOC 2012 validation set given a 
fixed budget (and number of images an¬ 
notated within that budget). Point-level 
supervision provides the best tradeoff be¬ 
tween annotation time and accuracy. De¬ 
tails in Section l5?5l 
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Fig. 5. Results without resource con¬ 
straints on the PASCAL VOC 2012 test 
set. The x-axis is log-scale. 


pixels: we collect an average of 502.7 supervised pixels per image with squig- 
gles, vs. 1.5 with IPoint. Like points, squiggles provide a nice tradeoff between 
accuracy and annotation cost. The squiggle-supervised model achieves 16.9% 
higher mlOU than image-level labels and 6.4% higher mlOU than IPoint^ at 
only 1.6 — 1.7x the cost. However, squiggle-level supervision falls short of the 
hybrid approach on both annotation time and accuracy: squiggle-level takes 
a longer 35.2 seconds compared to 24.5 seconds for hybrid, and squiggle-level 
achieves only 49.1% mlOU compared to the better 53.1% mlOU with hybrid. 
This suggests that hybrid supervision combining large-scale point-level annota¬ 
tions with full annotation on a handful of images is a better annotation strategy 
than squiggle-level annotation. 


5.5 Segmentation Accuracy on a Budget 

Fixed Budget. Given a fixed annotation time budget, what is the right strat¬ 
egy to obtain the best semantic segmentation model possible? We investigate the 
problem by fixing the total annotation time to be the 10, 582 x (20.3) = 60 hours 
that it would take to annotate all the 10, 582 training times with image-level la¬ 
bels. For each supervision method, we then compute the number of images N 
that it is possible to label in that amount of time, randomly sample N images 
from the training set, use them to train a segmentation model, and measure the 
resulting accuracy on the validation set. Table [^reports both the number of im¬ 
ages N and the resulting accuracy of fully supervised (22.1% mlOU), image-level 
supervised (29.8% mlOU), squiggle-level supervised (40.2% mlOU) and point- 
level supervised (42.9% mlOU) model. Point-level supervision outperforms 
the other types of supervision on a fixed budget, providing an optimal 
tradeoff between annotation time and resulting segmentation accuracy. 

Comparisons to Others. For the rest of this section, we use a model trained 
on all 12,031 training+validation images and evaluate on the PASCAL VOC 2012 
test set (as opposed to the validation set above) to allow for fair comparison to 
prior work. Point-level supervision {Img + IPointP Obj) obtains 43.6% mlOU 
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on the test set. Fig.j^shows the tradeoffs between annotation time and accuracy 
of different methods, discussed below. 

Unlimited Budget (Strongly Supervised). We compare both the anno¬ 
tation time and accuracy of our point-supervised 1 Point model with published 
techniques with much larger annotation budgets, as a reference for what might 
be achieved by our method if given more resources. Long et al [5] reports 62.2% 
mlOU, Hong et al [33] reports 66.6% mlOU, and Chen et al |37| reports 71.6% 
mlOU, but in the fully supervised setting that requires about 800 hours of an¬ 
notation, an order of magnitude more time-consuming than point supervision. 
Future exploration will reveal whether point-level supervision would outperform 
a fully supervised algorithm given 800 annotation hours of data. 

Small Budget (Weakly Supervised). We also compare to weakly super¬ 
vised published results. Pathak ICLR et al [23] achieves 25.7% mlOU, Pathak 
ICCV et al. [6] achieves 35.6% mlOU, and Papandreou et al. [4] achieves 39.6% 
mlOU with only image-level labels requiring approximately 67 hours of annota¬ 
tion on the 12,301 images (Section®. Pinheiro et al. [25] achieves 40.6% mlOU 
but with 400 hours of annotationsjj We improve in accuracy upon all of these 
methods and achieve 43.6% with point-level supervision requiring about 79 an¬ 
notation hours. Note that our baseline model is a significantly simplified version 
of |23l4j . Incorporating additional features of their methods is likely to further 
increase our accuracy at no additional cost. 

Size constraint. Finally, we compare against the recent work of [6] which 
trains with image-level labels but incorporates an additional bit of supervision 
in the form of object size constraints. They achieve 43.3% mlOU (omitting the 
CRF post-processing), on par with 43.6% using point-level supervision. This size 
constraint should be fast to obtain although annotation times are not reported. 
These two simple bits of supervision (point-level and size) are complementary 
and may be used together effectively in the future. 

6 Conclusions 

We propose a new time-efficient supervision approach for semantic image seg¬ 
mentation based on humans pointing to objects. We show that this method 
enables training more accurate segmentation models than other popular forms 
of supervision when given the same annotation time budget. In addition, we in¬ 
troduce an objectness prior directly in the loss function of our CNN to help infer 
the extent of the object. We demonstrated the effectiveness of our approach by 
evaluating on the PASCAL VOC 2012 dataset. We hope that future large-scale 
semantic segmentation efforts will consider using the point-level supervision we 
have proposed, building upon our released dataset and annotation interfaces. 

« m trains with only image-level annotations but adds 700,000 additional positive 
ImageNet images and 60,000 background images. We choose not to count the 700,000 
freely available images but the additional 60,000 background images they annotated 
would take an additional 60,000 x 20 classes xl second = 333 hours. The total 
annotation time is thus 333 + 67 = 400 hours. 
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